Malware detection using pattern classification

ABSTRACT

A malware classifier uses features of suspect software to classify the software as malicious or not. The classifier uses a pattern classification algorithm to statistically analyze computer software. The classifier takes a feature representation of the software and maps it to the classification label with the use of a trained model. The feature representation of the input computer software includes the relevant features and the values of each feature. These features include the categories of: applicable software characteristics of a particular type of malware; dynamic link library (DLL) and function name strings typically occurring in the body of the malware; and other alphanumeric strings commonly found in malware. By providing these features and their values to the classifier, the classifier is better able to identify a particular type of malware.

FIELD OF THE INVENTION

The present invention relates generally to addressing malicious softwarein computer systems. More specifically, the present invention relates tomalware detection using a pattern classification algorithm based uponfeatures of the malware.

BACKGROUND OF THE INVENTION

Currently, it is common for malicious software such as computer viruses,worms, spyware, etc., to affect a computer such that it will not behaveas expected. Malicious software can delete files, slow computerperformance, clog e-mail accounts, steal confidential information, causecomputer crashes, allow unauthorized access and generally perform otheractions that are undesirable or not expected by the user of thecomputer.

Current technology allows computer users to create backups of theircomputer systems and of their files and to restore their computersystems and files in the event of a catastrophic failure such as a lossof power, a hard drive crash or a system operation failure. Assumingthat the user had performed a backup prior to the failure, it can bestraightforward to restore their computer system and files to a stateprior to the computer failure. Unfortunately, these prior art techniquesare not effective when dealing with infection of a computer by malicioussoftware. It is important to be able to detect such malware when itfirst becomes present in a computer system, or better yet, before it canbe transferred to a user's computer.

Prior art techniques able to detect known malware use a predefinedpattern database that compares a known pattern with suspected malware.This technique, though, is unable to handle new, unknown malware. Otherprior art techniques use predefined rules or heuristics to detectunknown malware. These rules take into account some characteristics ofthe malware, but these rules need to be written down manually and arehard to maintain. Further, it can be very time-consuming and difficultto attempt to record all of the rules necessary to detect many differentkinds of malware. Because the number of rules is often limited, thistechnique cannot achieve both a high detection rate and a lowfalse-positive rate.

Given the above deficiencies in the prior art in being able to detectunknown malware efficiently, a suitable solution is desired.

SUMMARY OF THE INVENTION

To achieve the foregoing, and in accordance with the purpose of thepresent invention, a malware classifier is disclosed that uses featuresof suspect software to classify the software as malicious or not. Thepresent invention provides the ability to detect a high percentage ofunknown malware with a very low false-positive rate.

A malware classifier uses a pattern classification algorithm tostatistically analyze computer software in order to categorize it bygiving it a classification label. Any suspect computer software is inputto the malware classifier with the resulting output being a label thatidentifies the software as benign, normal software or as a particulartype of malicious software. The classifier takes a featurerepresentation of the software and maps it to the classification labelwith the use of a trained model, or function definition.

The feature representation of the input computer software includes therelevant features and the values of each feature. These features includethe categories of: applicable software characteristics of a particulartype of malware; dynamic link library (DLL) and function name stringstypically occurring in the body of the malware; and other alphanumericstrings commonly found in malware. By providing these features and theirvalues to the classifier, the classifier is better able to identify aparticular type of malware.

One embodiment is a method for training a malware classifier. A featuredefinition file is created that includes features relevant to theidentification of the type of malware. Software training data isselected that includes known malware as well as benign software. Atraining application is executed that outputs a trained model foridentifying the particular type of malware.

A second embodiment is a method for classifying suspect software. First,a group of features relevant to a particular type of malware areselected along with a trained model that has been trained to identifythe same type of malware. The malware classifier extracts features andtheir values from suspect software and inputs same to a classificationalgorithm. The classification algorithm outputs a classification labelfor the suspect software, identifying it as malware or as benign.

A third embodiment is a malware classifier apparatus. The apparatusincludes a feature definition file having features known to beassociated with the type of malware, a model being trained to identifythat malware, a feature extraction module and a pattern classificationalgorithm. In one specific embodiment, the classification algorithm isthe support vector machine (SVM) algorithm.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention, together with further advantages thereof, may best beunderstood by reference to the following description taken inconjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of a malware classifier according to oneembodiment of the invention.

FIG. 2 illustrates the header of a file in portable executable format.

FIG. 3 is a table illustrating the use of function names as features aswell as alphanumeric strings.

FIG. 4 illustrates a list of features and their values from a real-worldworm.

FIG. 5 illustrates a hyper plane used in the SVM algorithm.

FIG. 6 illustrates a situation in which the training samples are notlinearly separable.

FIG. 7 is a flow diagram describing the classification of computersoftware.

FIG. 8 is a block diagram illustrating the creation of a trained model.

FIG. 9 is a flow diagram describing training of the classificationalgorithm and the creation of a trained model.

FIGS. 10A-10F show portions of a feature definition file.

FIG. 11 is an example showing a trained model output by the trainingapplication for the purposes of detecting a computer worm.

FIGS. 12A and 12B illustrate a computer system suitable for implementingembodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention is applicable to all malicious software, ormalware, that generally causes harm to a computer system, provides aneffect that is not expected by the user, is undesirable, illegal, orotherwise causes the user to want to restore their computer system froma time prior to when it was infected by the malware. Malware can beclassified based upon how is executed, how it spreads or what it does.The below descriptions are provided as guidelines for the types ofmalware currently existing; these classifications are not perfect inthat many groups overlap. Of course, later developed software notcurrently known may also fall within the definition of malware.

When computer viruses first originated common targets were executablefiles and the boot sectors of floppy disks; later targets were documentsthat contain macro scripts, and more recently, many computer viruseshave embedded themselves in e-mail as attachments. With executable filesthe virus arranges that when the host code is executed the virus code isexecuted as well. Normally, the host program continues to function afterit is infected by the virus. Some viruses overwrite other programs withcopies of themselves, thus destroying the program. Viruses often spreadacross computers when the software or document to which they areattached is transferred from one computer to another. Computer worms aresimilar to viruses but are stand-alone software and thus do not requirehost files or other types of host code to spread themselves. They domodify the host operating system, however, at least to the extent thatthey are started as part of the boot process. In order to spread, wormseither exploit some vulnerability of the target host or use some kind ofsocial engineering to trick users into executing them.

A Trojan horse program is a harmful piece of software that is oftendisguised as legitimate software. Trojan horses cannot replicatethemselves, unlike viruses or worms. A Trojan horse can be deliberatelyattached to otherwise useful software by a programmer, or can be spreadby tricking users into believing that it is useful. Some Trojan horsescan spread or activate other malware, such as viruses (a dropper). Awabbit is a third, uncommon type of self-replicating malware. Unlikeviruses, wabbits do not infect host programs or documents. And unlikeworms, rabbits do not use network functionality to spread to othercomputers. A simple example of a wabbit is a fork bomb.

Spyware is a piece of software that collects and sends information (suchas browsing patterns or credit card numbers) about users and the resultsof their computer activity without explicit notification. Spywareusually works and spreads like Trojan horses. The category of spywaremay also include adware that a user deems undesirable. A backdoor is apiece of software that allows access to the computer system by bypassingthe normal authentication procedures. There are two groups of backdoorsdepending upon how they work and spread. The first group work much likea Trojan horse, i.e., they are manually inserted into another piece ofsoftware, executed via their host software and spread by the hostsoftware being installed. The second group work more like a worm in thatthey get executed as part of the boot process and are usually spread byworms carrying them as their payload. The term ratware has arisen todescribe backdoor malware that turns computers into zombies for sendingspam.

An exploit is a piece of software that attacks a particular securityvulnerability. Exploits are not necessarily malicious in intent—they areoften devised by security researchers as a way of demonstrating thatvulnerability exists. They are, however, a common component of maliciousprograms such as network worms. A root kit is software inserted onto acomputer system after an attacker has gained control of the system. Rootkits often include functions to hide the traces of the attack, as bydeleting logged entries or by cloaking the attacker's processes. Rootkits might include backdoors, allowing the attacker to easily regainaccess later or to exploit software to attack other systems. Becausethey often hook into the operating system at the kernel level to hidetheir presence, root kits can be very hard to detect.

Key logger software is software that copies a computer user's keystrokesto a file which it may send to a hacker at a later time. Often the keylogger software will only awaken when a computer user connects to asecure web site such as a bank. It then logs the keystrokes, which mayinclude account numbers, PINs and passwords, before they are encryptedby the secure web site. A dialer is a program that replaces thetelephone number in a modem's dial-up connection with a long-distancenumber (often out of the country) in order to run up telephone chargeson pay-per-dial numbers, or dials out at night to send key logger orother information to a hacker. Software known as URL injection softwaremodifies a browser's behavior with respect to some or all domains. Itmodifies the URL submitted to the server to profit from a given schemeby the content provider of the given domain. This activity is oftentransparent to the user.

The present invention is suitable for use with a wide variety of typesand formats of malware. The below description provides an example of theuse of the invention with malware written in the portable executable(PE) format. As is known in the art, the portable executable format isan executable file format used in 32-bit and 64-bit versions ofMicrosoft operating systems. The portable executable format is amodified version of the UNIX COFF file format. Of course, the presentinvention applies to computer files in other formats as well.

Malware Classifier

A malware classifier is a software application that uses a patternclassification algorithm to statistically analyze computer software inorder to categorize it by giving it a classification label. Any suspectcomputer software may be input to the malware classifier with theresulting output being a label that identifies the software as benign,normal software or as a particular type of malicious software. Theclassifier takes a feature representation of the software and maps it tothe classification label with the use of a trained model, or functiondefinition.

FIG. 1 is a block diagram of a malware classifier 100 according to oneembodiment of the invention. Input to classifier 100 is computersoftware 110 which is suspected of being malware. A feature definitionfile 120 lists all relevant features of any potential computer softwareand the corresponding attributes for each feature. Feature extractionmodule 125 is computer software that extracts values for the definedfeatures from the input computer software 110. Trained model 130 is thetrained classification function in the form of a computer file that isoutput by a separate training application as described below. Model 130is trained by mapping a vector of features into one of several classesby looking at many input-output examples. Pattern classificationalgorithm 140 is any suitable pattern classification algorithm thataccepts feature values and the trained model as input and outputs aclassification label 150, or class, for the input computer software 110.Classification algorithm 140 is designed to approximate the behavior ofthe trained model.

As alluded to above, an effective malware classifier relies upon asuitable classification algorithm, a set of features, featurenormalization methods, and training samples (i.e., examples of benignsoftware and malware).

Software Features

Current technologies for detecting malware include noting maliciousbehavior such as an abnormal TCP connection on a given port or theadding of a registry key that automatically loads itself when theoperating system starts. Certain types of malware, however, havebehaviors that can be difficult to track. A worm, for example, cancreate processes with different names on different machines and canbehave differently on different machines, all of which make its behaviordifficult to track.

But, each type of malware exhibits a certain pattern which is differentfrom that of benign computer software. A worm, for example, is likely tocall RegCreateKey and RegSetValue, to add an entry inHKLM\Software\Microsoft\CurrentVersion\Run, and to call connect orCopyFile or CreateFile in order to propagate itself. Plus, most of theeffort expended by the worm involves propagating itself and damagingfiles, so there are not many calls to GDI functions or to CommonControls functions. Further, the header of a worm written in a portableexecutable format will have certain characteristics. Each of the othervarious types of malware (such as viruses, spyware, adware, etc.) alsowill have distinctive characteristics and will exhibit distinctivebehavior. It is therefore realized that a known pattern classificationalgorithm may be used to analyze these features of computer softwaresuspected of being malware and to output a result that classifies thecomputer software as benign or as a particular type of malware.

In one embodiment of the invention, a specific feature definition fileis used to classify each type of malware. For example, if it is decidedto implement a malware classifier that will detect computer worms then afeature definition file is constructed having specific features relevantto computer worms. On the other hand, if it is desired to detectspyware, a separate feature definition file is used including featuresknown to be present in spyware. Of course, if the goal is to detectcomputer worms, then training data is supplied to the trainingapplication (described below) having examples of computer worms andbenign software. The resulting trained model is tuned specifically todetect computer worms and is used in conjunction with a featuredefinition file containing worm features.

In an alternative embodiment, it is possible that a single featuredefinition file may be used to detect two or more types of malware. Forexample, two sets of features identifying two types of malware arecombined into one large feature set. Assume features f0, f1 and f2 arefor detecting malware type #1, and that features f3, f4 and f5 are fordetecting malware type #2. The combined features set f0, f1, f2, f3, f4and f5 is used to detect malware types #1 and #2 by using aclassification algorithm that combines the logic of the classificationfunctions for detecting malware types #1′ and #2.

Feature definition file 120 lists all of the relevant features and theattributes of each feature that might possibly be encountered incomputer software 110. These features include the categories of:applicable software characteristics; dynamic link library (DLL) andfunction name strings occurring in the body of the software; and otherstrings commonly found in malware. Other types of features may also beused.

In this embodiment, the applicable software characteristics include thefields of the header of a file in portable executable format. Forexample, these fields are: packed, packer, number of sections, imagesize, code section size, import table size, export table size, resourcesize, subsystem, initialized section size, on initialized section size,image base address, and entry point location. FIG. 2 illustrates theheader 210 of a file in portable executable format. Shown is relevantheader information that contain suitable characteristics to use asfeatures. Of course, header 210 is specific to a portable executableformat, other file types will have other relevant header information andcharacteristics.

Another category of features include dynamic link library (DLL) andfunction name strings occurring in the body of the software. Thiscategory enumerates DLL name strings and function name strings thatmight be imported by suspected malware. In this particular embodiment,the enumerated strings are those that might be used by malware in aportable executable format. Each name string is considered a feature andthe value of each of these features will either be one or zero dependingupon whether the name string occurs in the body of the suspect computersoftware. For example, consider kernel32.dll, comctl32.dll, urlmon.dll,shell32.dll, advapi32.dll, InterlockedIncrement, GetThreadLocale asfeatures Fk, Fc, Fu, Fs, Fa, Fi and Fg accordingly. For a given suspectcomputer software, if only the strings “advapi32.dll” and“GetThreadLocale” are found in its body, then the values of Fa and Fgare each one while the other values are all zero. Other possiblefunctions include RegDeleteValue, RegEnumValue, CreateThread andCreatePipe, etc.

FIG. 3 is a table 260 illustrating the use of function names asfeatures, as well as alphanumeric strings described below. This tablelists examples of those function names that are commonly associated withmalware; many other function names are possible. Column 262 listsexamples of function names (“Callxxx”) that might appear as stringswithin the body of suspect computer software, as well as feature namesthat perform a count of particular alphanumeric strings founds in thesoftware (“Countxxx”). Colume 264 lists the corresponding value for eachfunction name that is considered a feature. While the “Call” featurenames will have a value of one or zero, the “Count” feature names willhave any integer value depending upon the particular data.

Because many malware programs are packed, leaving only the stub of theimport table or perhaps even no import table, the malware classifierwill search for the name of the dynamic link library or function in thebody of the suspected malware. Adding more function names or dynamiclink library names as features will likely yield better classificationresults.

A third category of features include alphanumeric strings commonly foundin malware. These are strings identifying registry keys, passwords,games, e-mail commands, etc. that malware typically uses. The presenceof a quantity of these strings in a given computer software programindicates it is more likely than not that the software is malware. Forexample, a string indicating that computer software has been compressedby tool like UPX is a good indicator that the software might be malwaresince benign computer software seldom uses that tool. Also, malwareoften steals and uses the CD keys for some of the common computer games.

Examples of these strings include auto-run registry keys such as

CurrentVersion\Run

CurrentVersion\Run Services

HKLM\Windows\Software\Microsoft\CurrentVersion\Run and

HKCR\exefile\shell\open\command.

Other examples include commonly used passwords such as “administrator,”“administrateur,” “administrador,” “1234,” “password123,” “admin123,”etc.; registry keys or installation paths of games such as “IllusionSoftworks\Hidden & Dangerous 2,” “Electronic Arts\EA Sports” and“Westwood\Red Alert”; SMTP commands such as “MAIL FROM:” and “RCPT TO:”;peer-to-peer application names such as “KaZaA,” “emule,” “WinMX,” “ICQ,”“MSN Messenger,” Yahoo Messenger,” etc.; HTML syntax such as “<html>,<body>”; and scripting objects such as “WScript.Shell,”“Scripting.FileSystemObject,” “Word.Application,” etc.

These alphanumeric strings are considered features within the malwareclassifier and the value of each of these features can be either one orzero, indicating whether or not the given string exists in the body ofthe suspect computer software. A count of the number of times a stringappears may also be a feature value. The present embodiment uses around200 features.

Feature extraction module 125 extracts feature values from suspectcomputer software 110 corresponding to features present in featuredefinition file 120. In the embodiment in which software 110 is in theportable executable format, it is first necessary for the extractionmodule to unpack the file before extracting the feature values.

FIG. 4 illustrates a list of features and their values from a real-worldworm titled “WORM.BROPIA.F.” As shown in figure, certain characteristics310 found in the header of the portable executable file have particularfeature values. Further, numerous names of dynamic link libraries 320are found but very few function names are found (not shown). Finally,there are TFTP strings, numerous game names and a registry key 330.Using these features and their values, along with input from trainedmodel 130, classification algorithm 140 outputs an appropriateclassification label 150 of “worm.”

Classification Algorithm

Pattern classification algorithm 140 is any suitable classificationalgorithm. A classification algorithm is designed to learn (or toapproximate) a function that maps a vector of features into one ofseveral classes by looking at many input-output examples of thefunction. Any of the standard types of classification algorithms, e.g.,Decision Tree, Naïve Bayes, or Neural Network may be used to implementthe malware classifier. In a situation where the number of features ishigh, some algorithms may not be well-suited. In one specificembodiment, the present invention uses the Support Vector Machinealgorithm; SVM is described in T. Joachims, Making Large-Scale SVMLearning Practical, Advances in Kernel Methods—Support Vector Learning,B. Scholkopf, C. Burges and A. Smola (ed.), MIT Press, 1999. There aremany sources of SVM software available, such as: SVM Light, SVM Torch,Libsvm, SVMFu, SMO and many others that can be found at the web site“kernel-machines.org.” The present invention makes use of the SVM Lightsoftware. An online tutorial regarding the SVM algorithm is found at theweb site http://159.226.40.18/tools/support %20vector %20machine.ppt,”and documents entitled “SVM Rules of Thumb” are also available on theInternet.

FIG. 5 illustrates a hyper plane 410 used in the SVM algorithm. Briefly,the SVM algorithm creates a maximum-margin hyper plane that lies in atransformed input space. Given training samples labeled either “+” 420or “−” 430, a maximum-margin hyper plane splits the two groups oftraining samples, such that the distance from the closed samples (themargin 440) to the hyper plane is maximized. For situations such as theone shown in FIG. 5 the training samples are linearly separable.

FIG. 6 illustrates a situation in which the training samples are notlinearly separable. Graph 460 shows a situation in which the samples arenot linearly separable and can only be separated by using a curved line465. A kernel function “f” 472 is thus used to convert the originalfeature space into a linearly separable one. Graph 470 shows theconverted feature space in which the training samples (or rather, theirconverted values) are now linearly separable using line 475.

Further details regarding operation of the SVM algorithm are omitted asgeneral use of the SVM algorithm is known to those of skill in the art.

Classification of Malware

In general, the classification of computer software 110 involves loadingthe feature definition file, using the feature extraction module toobtain feature values from the computer software, loading the functiondefinition into the trained model in order to initialize theclassification algorithm, and passing the feature values as variablesinto the classification algorithm that then outputs a classificationlabel for the computer software.

FIG. 7 is a flow diagram describing the classification of computersoftware. In step 504 feature definition file 120 is loaded into themalware classifier; the choice of a particular feature definition filewill depend upon which type of malware it is desired to classify. Instep 508 the trained model 130, or function definition, is also loadedinto the malware classifier; again, choice of a particular trained modeldictates which type of malware the classifier will be able to detect andclassify.

In step 512 the suspect software 110 is obtained and input into themalware classifier 100. The suspect software may originate from a widevariety of sources. By way of example, the malware classifier isintegrated into an anti-spyware software product and whenever a file isaccessed (i.e., opened or executed) that particular file is input to themalware classifier. In other examples a user may manually select a fileor folder of files in order to classify a file, or the classifier may beused upon incoming e-mail attachments, etc.

In step 516 feature extraction module 125 extracts the features andtheir values from the input software using feature definition file 120.In step 520 the pattern classification algorithm (in this embodiment,the SVM Light software) accept as input the extracted values and by useof the trained model outputs a classification label 150. In the examplein which the model is trained to detect computer worms and the featuredefinition file contains features relevant to worms, the classificationlabel output will be either “worm” or “normal.” For the detection ofother particular types of malware using suitable models and featuredefinition files, the classification labels will depend on those typesof malware.

Training the Classification Algorithm

FIG. 8 is a block diagram illustrating the creation of trained model130. Training application 145 depends on feature definition file 120 andincludes feature extraction module 125. Training application 145 takesboth normal computer software and a particular type of malicioussoftware as training data 160 and, after computation, outputs thetrained classification function. In one particular embodiment, model 130takes the form of a computer file. Training of the classificationalgorithm uses a database of positive samples (benign computer software)and negative samples (computer software that is known malware).

In the particular embodiment that makes use of the SVM Light software,there are two executable application files provided: svm_learn andsvm_classify. The training application svm_learn accepts as input thefeature definition file and training data 160, i.e., any number of knownmalicious and known benign software applications, and can be controlledby two parameters. The output of the training application provides ameasurement of the effectiveness of the malware classifier.

The first parameter accepted by the training application (“−c”) controlsthe trade-off between the margin and the number of misclassifiedsamples. The value of the parameter is selected by the user and is oftendetermined experimentally using a validation set. Larger values oftenlead to fewer support vectors, a larger VC dimension, a smaller trainingerror and a smaller margin. The second parameter (“−t”) selects a kernelfunction. SVM Light has four predefined kernel functions: a linearfunction (the default), a polynomial function, an RBF function and asigmoid function. A user may also define a custom kernel function bymodifying the source code.

The output of the training application includes the values VC(Vapnik-Chervonenkis) dimension, precision, recall, accuracy and error.The value VC dimension measures the capacity of the trained model;choosing a model with a smaller VC dimension leads to results that areless likely to be over fit. Precision is the proportion of retrieveditems that are relevant, i.e., the ratio of true positives to the sum oftrue positives and false positives. Recall is the proportion of relevantitems that are retrieved to the total number of relevant items in thedata set, i.e., the ratio of true positives to the sum of true positivesand false negatives. Accuracy is the portion of correctly classifiedsamples, i.e., the ratio of true positives and true negatives to the sumof items in the data set. Error is the portion of incorrectly classifiedsamples, i.e., the ratio of false positives and false negatives to thesum of items in the data set.

The values of precision, recall and error estimate the potentialperformance of the malware classifier on new samples, not the actualmeasurement of performance on the training samples.

FIG. 9 is a flow diagram describing training of the classificationalgorithm and the creation of trained model 130. As a threshold matter,it is determined whether to create a trained model to detect worms,spyware, adware, or dialers, etc. Once it is determined for which typeof malware to screen, in step 604 classification labels are determined.For example, if the model is to be trained to detect computer worms,then the possible classification labels are either “worm” or “normal.”In step 608 appropriate features relevant to the detection of a computerworm (for example) are selected and added to a feature definition file.Examples of feature definition files are presented below. In thisparticular embodiment, the feature definition file includes featuresspecific to the detection of a computer worms.

The selection of features relevant to the detection of a computer worm(for example) involves an engineer's experience, background knowledgeand analysis of computer worms. Selection of features relevant to thedetection of other types of malware also involve knowledge of thatparticular type of malware. Examples of suitable relevant features forthree types of malware are shown below.

In step 612 training samples are collected and stored, for example, infolders on the computer. Training samples would include any number ofknown computer worms (i.e., on the order of thousands) as well as a widevariety of different types of benign software files and applications.The known computer worms are placed into one folder identified ascontaining worms, and the benign software is placed into another folderidentified as such. Such organization may be performed manually or isautomated. It is preferable that the examples of benign software includemany different types of software documents and software applications,including many popular applications, in order to provide a bettertrained model. In step 616 parameters are selected for the trainingapplication as discussed above. One technique for choosing the bestparameters is simply trial and error. Once the model is trained it isthen used to classify known worms and known benign applications; if theresults are not good (i.e. too many false positives) then the parametersare modified and the training application is run again to produce a newmodel.

In step 620 training application 145 is executed using featuredefinition file 120 and training data 160 as input to produce trainedmodel 130 in step 624. In step 628 measurement results are also output(as described above) and the model can be validated in step 632 by usingknown types of normal software and malware. Preferably, validationincludes giving the malware classifier computer worms that have not beenused before by the training application.

Worm Classification Example

The following example describes feature selection, training parametersand results for a malware classifier designed to detect computer worms.As mentioned above, the three categories of features selected arecharacteristics of the software, commonly used dynamic link librariesand function names, and strings commonly seen in computer worms.

FIGS. 10A-10E show portions of feature definition file 120 that includethe above categories of features pertaining to computer worms. FIG. 10Ashows characteristics 704 found in the header of a portable executableformat file. FIGS. 10B, 10C and 10C show features representing commonlyused dynamic link libraries and function names. As shown, referencenumerals 708-752 lists particular function names along with theirassociated dynamic link library name. FIGS. 10E and 10F show features ofthe feature definition file corresponding to strings commonly seen incomputer worms. Shown are registry keys 756, common passwords 760,commonly accessed games 764, an HTTP string 768, a script string 772,commonly used e-mail strings 776, HTML strings 780, strings commonlypresent in malicious Windows batch files 784, and commonly usedpeer-to-peer strings 788.

Before training the model the feature values are first normalized, i.e.,the values are transformed so that they fall between 0 and 1. Bychoosing the default linear kernel function the results are quite good;true positives are around 90% and false positives are around 0.5%. Useof a polynomial function provides even better results but at the expenseof sacrificing a larger value for VC dimension. The training dataincludes about 2,000 known computer worms and about 7,000 normal (i.e.,benign) software applications. Three different models are trained: amodel using a polynomial kernel function, a model using a linear kernelfunctions, and a model using a polynomial kernel function with a lowfalse-positive rate. Once the models are trained, the classificationapplication svm_classify is used to validate the results against thetraining data.

The particular command line and parameters used in training SVM Lightwith a polynomial function is:svm_learn.exe−c0.01−t1−d2

This particular model resulted in true positives of about 92.54% andfalse positives of about 0.01138%. The VC dimension value is 8288. Thegeneral XiAlpha estimation is an error value of less than or equal to7.15%, the recall value is greater than or equal to 95.06%, and theprecision value is greater than or equal to 96.04%.

The particular command line and parameters used in training SVM Lightwith a linear function is:svm_learn.exe−c0.01−t0

This particular model resulted in true positives of about 86.03% andfalse positives of about 0.17078%. The VC dimension value is 243. Thegeneral XiAlpha estimation is an error value of less than or equal to6.27%, the recall value is greater than or equal to 97.31%, and theprecision value is greater than or equal to 95.03%.

The particular command line and parameters used in training SVM Lightwith the second polynomial function is:svm_learn.exe−c0.01−dj10

This model is generated by choosing a large margin and a high penalty ona false positive rate. This particular model results in true positivesof about 82.61% and false positives of about 0%. The VC dimension valueis 368. The general XiAlpha estimation is an error value of less than orequal to 6.98%, the recall value is greater than or equal to 98.93%, andthe precision value is greater than or equal to 92.88%.

The above-mentioned models were all generated using computer wormssamples from 2004. Once the models have been trained and validated eachclassifier is tested against new data. In order to determine howeffective each model is against unknown computer worms, each classifierwas tested against computer worms discovered in January, 2005. Theresults indicate that true positives for the linear kernel model is91.53%, true positives for the polynomial kernel model is 91.53%, andtrue positives for the polynomial kernel with a low false positive rateis 84.15%. In order to check false positives, each classifier was run onseveral personal computers and directed to classify all files found in aportable executable format. The results were between 0.23% and 0.1%.

FIG. 11 is an example showing a trained model 810 output by the trainingapplication for the purposes of detecting a computer worm. Shown areparameters 820 used in the creation of the model, information regardingthe classifier 830 including use of a linear function, and a string ofparameter values 840 used for training.

Computer System Embodiment

FIGS. 12A and 12B illustrate a computer system 900 suitable forimplementing embodiments of the present invention. FIG. 12A shows onepossible physical form of the computer system. Of course, the computersystem may have many physical forms including an integrated circuit, aprinted circuit board, a small handheld device (such as a mobiletelephone or PDA), a personal computer or a super computer. Computersystem 900 includes a monitor 902, a display 904, a housing 906, a diskdrive 908, a keyboard 910 and a mouse 912. Disk 914 is acomputer-readable medium used to transfer data to and from computersystem 900.

FIG. 12B is an example of a block diagram for computer system 900.Attached to system bus 920 are a wide variety of subsystems.Processor(s) 922 (also referred to as central processing units, or CPUs)are coupled to storage devices including memory 924. Memory 924 includesrandom access memory (RAM) and read-only memory (ROM). As is well knownin the art, ROM acts to transfer data and instructions uni-directionallyto the CPU and RAM is used typically to transfer data and instructionsin a bi-directional manner. Both of these types of memories may includeany suitable of the computer-readable media described below. A fixeddisk 926 is also coupled bi-directionally to CPU 922; it providesadditional data storage capacity and may also include any of thecomputer-readable media described below. Fixed disk 926 may be used tostore programs, data and the like and is typically a secondary storagemedium (such as a hard disk) that is slower than primary storage. Itwill be appreciated that the information retained within fixed disk 926,may, in appropriate cases, be incorporated in standard fashion asvirtual memory in memory 924. Removable disk 914 may take the form ofany of the computer-readable media described below.

CPU 922 is also coupled to a variety of input/output devices such asdisplay 904, keyboard 910, mouse 912 and speakers 930. In general, aninput/output device may be any of: video displays, track balls, mice,keyboards, microphones, touch-sensitive displays, transducer cardreaders, magnetic or paper tape readers, tablets, styluses, voice orhandwriting recognizers, biometrics readers, or other computers. CPU 922optionally may be coupled to another computer or telecommunicationsnetwork using network interface 940. With such a network interface, itis contemplated that the CPU might receive information from the network,or might output information to the network in the course of performingthe above-described method steps. Furthermore, method embodiments of thepresent invention may execute solely upon CPU 922 or may execute over anetwork such as the Internet in conjunction with a remote CPU thatshares a portion of the processing.

In addition, embodiments of the present invention further relate tocomputer storage products with a computer-readable medium that havecomputer code thereon for performing various computer-implementedoperations. The media and computer code may be those specially designedand constructed for the purposes of the present invention, or they maybe of the kind well known and available to those having skill in thecomputer software arts. Examples of computer-readable media include, butare not limited to: magnetic media such as hard disks, floppy disks, andmagnetic tape; optical media such as CD-ROMs and holographic devices;magneto-optical media such as floptical disks; and hardware devices thatare specially configured to store and execute program code, such asapplication-specific integrated circuits (ASICs), programmable logicdevices (PLDs) and ROM and RAM devices. Examples of computer codeinclude machine code, such as produced by a compiler, and filescontaining higher level code that are executed by a computer using aninterpreter.

Although the foregoing invention has been described in some detail forpurposes of clarity of understanding, it will be apparent that certainchanges and modifications may be practiced within the scope of theappended claims. Therefore, the described embodiments should be taken asillustrative and not restrictive, and the invention should not belimited to the details given herein but should be defined by thefollowing claims and their full scope of equivalents.

Spyware Feature Definition File Example

As previously mentioned, the present invention may be used to detect awide variety of types of malware. Listed below is an example featuredefinition file for detecting spyware.

<?xml version=‘1.0’?> <mdxml version=“0.1”>  <metadataversion=“0.1.0.medium”>   <feature-set name=“pe-header”>    <featurename=“pe/number-of-sections” range=“0,30”/>    <feature name=“pe/length”range=“0,3978672”/>    <feature name=“pe/subsystem”categories=“0,1,2,3,5,7”/>    <feature name=“pe/code-size”/>    <featurename=“pe/initialized-data-size”/>    <featurename=“pe/uninitialized-data-size”/>    <featurename=“pe/entry-point-location”/>    <feature name=“pe/image-base”categories=“0x01000000, 0x00400000”/>    <featurename=“pe/import-table-size”/>    <featurename=“pe/resource-table-size”/>    <featurename=“pe/count-imported-dlls” range=“0,47”/>    <featurename=“pe/count-imported-functions” range=“0,1000”/>    <featurename=“pe/shell” categories=“none,upx,aspack”/>    </feature-set>   <feature-set name=“spyware-dll-usage”>    <featurename=“dll/kernel32.dll” range=“0,826”/>    <featurename=“dll/user32.dll” range=“0,695”/>    <featurename=“dll/advapi32.dll” range=“0,565”/>    <featurename=“dll/shell32.dll” range=“0,406”/>    <feature name=“dll/ole32.dll”range=“0,304”/>    <feature name=“dll/gdi32.dll” range=“0,543”/>   <feature name=“dll/oleaut32.dll” range=“0,360”/>    <featurename=“dll/wininet.dll” range=“0,208”/>    <featurename=“dll/comctl32.dll” range=“0,82”/>    <feature name=“dll/msvcrt.dll”range=“0,779”/>    <feature name=“dll/rasapi32.dll” range=“0,145”/>   <feature name=“dll/version.dll” range=“0,16”/>    <featurename=“dll/comdlg32.dll” range=“0,26”/>    <featurename=“dll/wsock32.dll” range=“0,75”/>    <feature name=“dll/mfc42.dll”range=“0,6933”/>    <feature name=“dll/rpcrt4.dll” range=“0,471”/>   <feature name=“dll/shlwapi.dll” range=“0,749”/>    <featurename=“dll/urlmon.dll” range=“0,77”/>    <feature name=“dll/ws2_32.dll”range=“0,109”/>    <feature name=“dll/msvbvm60.dll” range=“0,634”/>   <feature name=“dll/winspool.drv” range=“0,167”/>    <featurename=“dll/winmm.dll” range=“0,198”/>    <feature name=“dll/lz32.dll”range=“0,13”/>   </feature-set>   <feature-setname=“spyware-function-usage”>    <feature name=“api/GetProcAddress”/>   <feature name=“api/ExitProcess”/>    <featurename=“api/LoadLibraryA”/>    <feature name=“api/RegCloseKey”/>   <feature name=“api/GetModuleHandleA”/>    <featurename=“api/CloseHandle”/>    <feature name=“api/GetModuleFileNameA”/>   <feature name=“api/WriteFile”/>    <feature name=“api/GetLastError”/>   <feature name=“api/GetCommandLineA”/>    <featurename=“api/MultiByteToWideChar”/>    <feature name=“api/CreateFileA”/>   <feature name=“api/GetStartupInfoA”/>    <featurename=“api/WideCharToMultiByte”/>    <feature name=“api/SetFilePointer”/>   <feature name=“api/VirtualAlloc”/>    <feature name=“api/ReadFile”/>   <feature name=“api/VirtualFree”/>    <featurename=“api/RegQueryValueExA”/>    <feature name=“api/RtlUnwind”/>   <feature name=“api/GetFileType”/>    <featurename=“api/GetStdHandle”/>    <feature name=“api/lstrlenA”/>    <featurename=“api/MessageBoxA”/>    <feature name=“api/FreeLibrary”/>   <feature name=“api/RegOpenKeyExA”/>    <featurename=“api/CoInitialize”/>    <feature name=“api/lstrcpyA”/>    <featurename=“api/Sleep”/>    <feature name=“api/GetCurrentProcess”/>   <feature name=“api/ShellExecuteA”/>    <featurename=“api/InitializeCriticalSection”/>    <featurename=“api/LeaveCriticalSection”/>    <featurename=“api/EnterCriticalSection”/>    <featurename=“api/RegSetValueExA”/>    <featurename=“api/DeleteCriticalSection”/>    <feature name=“api/SetEndOfFile”/>   <feature name=“api/HeapAlloc”/>    <feature name=“api/HeapFree”/>   <feature name=“api/SendMessageA”/>    <featurename=“api/GetVersionExA”/>    <feature name=“api/GetCurrentThreadId”/>   <feature name=“api/DeleteFileA”/>    <feature name=“api/GetDC”/>   <feature name=“api/RaiseException”/>    <featurename=“api/LocalFree”/>    <feature name=“api/UnhandledExceptionFilter”/>   <feature name=“api/HeapDestroy”/>    <featurename=“api/TerminateProcess”/>    <feature name=“api/GetFileSize”/>   <feature name=“api/GetCPInfo”/>    <feature name=“api/HeapCreate”/>   <feature name=“api/GetACP”/>    <feature name=“api/HeapReAlloc”/>   <feature name=“api/GetVersion”/>    <featurename=“api/GetEnvironmentStrings”/>    <featurename=“api/SetHandleCount”/>    <feature name=“api/GetStringTypeW”/>   <feature name=“api/SetTimer”/>    <featurename=“api/InterlockedDecrement”/>    <feature name=“api/wsprintfA”/>   <feature name=“api/GetOEMCP”/>    <feature name=“api/ShowWindow”/>   <feature name=“api/GetStringTypeA”/>    <featurename=“api/LCMapStringA”/>    <featurename=“api/FreeEnvironmentStringsA”/>    <featurename=“api/CoCreateInstance”/>    <featurename=“api/FreeEnvironmentStringsW”/>    <featurename=“api/GetEnvironmentStringsW”/>    <featurename=“api/RegCreateKeyExA”/>    <feature name=“api/LCMapStringW”/>   <feature name=“api/DispatchMessageA”/>    <featurename=“api/InterlockedIncrement”/>    <feature name=“api/CreateThread”/>   <feature name=“api/InternetOpenA”/>    <featurename=“api/CreateDirectoryA”/>    <feature name=“api/SetWindowPos”/>   <feature name=“api/WaitForSingleObject”/>    <featurename=“api/DefWindowProcA”/>    <feature name=“api/DeleteObject”/>   <feature name=“api/GetClientRect”/>    <featurename=“api/FindFirstFileA”/>    <feature name=“api/lstrcpynA”/>   <feature name=“api/RegDeleteKeyA”/>    <featurename=“api/LocalAlloc”/>    <feature name=“api/FindClose”/>    <featurename=“api/PostQuitMessage”/>    <feature name=“api/GetWindowTextA”/>   <feature name=“api/BitBlt”/>    <featurename=“api/TranslateMessage”/>    <feature name=“api/lstrcatA”/>   <feature name=“api/EnableWindow”/>    <featurename=“api/CreateWindowExA”/>    <feature name=“api/LoadIconA”/>   <feature name=“api/GetTempPathA”/>    <featurename=“api/GetWindowRect”/>    <feature name=“api/GetLocaleInfoA”/>   <feature name=“api/FlushFileBuffers”/>    <featurename=“api/IsWindow”/>    <feature name=“api/CreateProcessA”/>   <feature name=“api/DestroyWindow”/>    <featurename=“api/LoadCursorA”/>    <feature name=“api/SelectObject”/>   <feature name=“api/SetStdHandle”/>    <featurename=“api/SetWindowTextA”/>    <feature name=“api/GetDlgItem”/>   <feature name=“api/PostMessageA”/>    <feature name=“api/lstrcmpiA”/>   <feature name=“api/RegDeleteValueA”/>    <featurename=“api/SetWindowLongA”/>    <feature name=“api/GetTickCount”/>   <feature name=“api/GetWindowsDirectoryA”/>    <featurename=“api/RasDialA”/>    <feature name=“api/GetMessageA”/>    <featurename=“api/KillTimer”/>    <feature name=“api/GetThreadLocale”/>   <feature name=“api/CharNextA”/>    <feature name=“api/SetFocus”/>   <feature name=“api/GetStockObject”/>    <featurename=“api/CreateSolidBrush”/>    <feature name=“api/GetWindowLongA”/>   <feature name=“api/CopyFileA”/>    <featurename=“api/GetSystemMetrics”/>    <feature name=“api/EndDialog”/>   <feature name=“api/VirtualQuery”/>    <featurename=“api/LoadLibraryExA”/>    <feature name=“api/SetTextColor”/>   <feature name=“api/EndPaint”/>    <feature name=“api/TlsGetValue”/>   <feature name=“api/BeginPaint”/>    <feature name=“api/TlsSetValue”/>   <feature name=“api/CoUninitialize”/>    <featurename=“api/SetBkMode”/>    <feature name=“api/LoadStringA”/>    <featurename=“api/GlobalAlloc”/>    <feature name=“api/VerQueryValueA”/>   <feature name=“api/GetLocalTime”/>    <feature name=“api/DeleteDC”/>   <feature name=“api/GetSysColor”/>    <featurename=“api/GetDeviceCaps”/>    <feature name=“api/FindResourceA”/>   <feature name=“api/HeapSize”/>    <featurename=“api/CreateCompatibleDC”/>    <featurename=“api/GetEnvironmentVariableA”/>    <featurename=“api/InvalidateRect”/>    <feature name=“api/LoadResource”/>   <feature name=“api/FillRect”/>    <feature name=“api/SetLastError”/>   <feature name=“api/IsBadReadPtr”/>    <feature name=“api/GetParent”/>   <feature name=“api/DialogBoxParamA”/>    <featurename=“api/CreateMutexA”/>    <feature name=“api/SystemParametersInfoA”/>   <feature name=“api/SetUnhandledExceptionFilter”/>    <featurename=“api/CompareStringA”/>    <feature name=“api/PeekMessageA”/>   <feature name=“api/InternetCloseHandle”/>    <featurename=“api/IsBadCodePtr”/>    <feature name=“api/IsBadWritePtr”/>   <feature name=“api/SetBkColor”/>    <featurename=“api/RemoveDirectoryA”/>    <feature name=“api/GetObjectA”/>   <feature name=“api/GlobalFree”/>    <feature name=“api/TlsAlloc”/>   <feature name=“api/FindWindowA”/>    <featurename=“api/GetDesktopWindow”/>    <feature name=“api/FindNextFileA”/>   <feature name=“api/SetForegroundWindow”/>    <featurename=“api/GetSystemTime”/>    <feature name=“api/InternetReadFile”/>   <feature name=“api/SizeofResource”/>    <featurename=“api/GetWindow”/>    <feature name=“api/EnumWindows”/>    <featurename=“api/GetCurrentProcessId”/>    <feature name=“api/ReleaseDC”/>   <feature name=“api/RegisterClassA”/>    <featurename=“api/lstrcmpA”/>    <feature name=“api/GetShortPathNameA”/>   <feature name=“api/GetFileAttributesA”/>    <featurename=“api/GlobalUnlock”/>    <feature name=“api/GlobalLock”/>   <feature name=“api/SendDlgItemMessageA”/>    <featurename=“api/lstrlenW”/>    <feature name=“api/GetSystemDirectoryA”/>   <feature name=“api/MulDiv”/>    <featurename=“api/GetTempFileNameA”/>    <feature name=“api/SetDlgItemTextA”/>   <feature name=“api/GetTimeZoneInformation”/>    <featurename=“api/GetFileVersionInfoA”/>    <feature name=“api/DrawTextA”/>   <feature name=“api/CreateFontIndirectA”/>    <featurename=“api/GetClassNameA”/>    <feature name=“api/UpdateWindow”/>   <feature name=“api/exit”/>    <feature name=“api/CoTaskMemFree”/>   <feature name=“api/LoadImageA”/>    <featurename=“api/SetEnvironmentVariableA”/>    <featurename=“api/GetDlgItemTextA”/>    <feature name=“api/CreateFontA”/>   <feature name=“api/GetSystemInfo”/>    <featurename=“api/CompareStringW”/>    <feature name=“api/free”/>    <featurename=“api/CallWindowProcA”/>    <feature name=“api/RegOpenKeyA”/>   <feature name=“api/OpenProcess”/>    <featurename=“api/FormatMessageA”/>    <feature name=“api/ScreenToClient”/>   <feature name=“api/RegEnumValueA”/>    <featurename=“api/RegEnumKeyExA”/>    <featurename=“api/SHGetSpecialFolderPathA”/>   </feature-set>   <feature-setname=“spyware-full”>     <feature-set-ref name=“pe-header”/>    <feature-set-ref name=“spyware-dll-usage”/>     <feature-set-refname=“spyware-function-usage”/>   </feature-set>  </metadata> </mdxml>

Dialer Feature Definition File Example

As previously mentioned, the present invention may be used to detect awide variety of types of malware. Listed below is an example featuredefinition file for detecting dialer malware. One of skill in the art,upon a reading of the specification and the examples contained herein,would be able to use invention to detect a variety of other types ofmalware.

<?xml version=“1.0” encoding=“iso-8859-1”?> <mdxml version=“0.1”> <metadata version=“0.1”>   <!-- Searching for web browsers -->    <feature-set name=“web-browser”>     <featurename=“match/opera.exe”/>     <feature name=“match/netscape.exe”/>    <feature name=“match/iexplore.exe”/>     <featurename=“match/Internet Explorer”/>     <feature name=“match/Netscape”/>    <feature name=“match/Opera”/>     <featurename=“match/network.proxy.type”/>     <feature name=“match/nsreg.dat”/>  </feature-set>   <!-- Access to web browsers' setttings -->  <feature-set name=“browser-hack” mode=“postfix”> <featurename=“match/Software\Microsoft\Internet Explorer\Toolbar\WebBrowser”/><feature name=“match/Software\Microsoft\Windows\CurrentVer-    sion\Internet Settings”/> <featurename=“match/Software\Netscape\Netscape Navigator\Users”/> <featurename=“match/Software\Netscape\Netscape Navigator\biff”/>    <featurename=“match/Software\Netscape\Netscape Navigator\Main”/>    <featurename=“match/Software\Microsoft\Internet Explorer\Main”/>    <featurename=“match/http\shell\open\command”/>    <featurename=“match/htmlfile\shell\open\command”/>   </feature-set>   <!-- Listof countries -->   <feature-set name=“countries” mode=“exact”>    <feature name=“match/countries” pattern=”     Domestic Premium,DiggoGarcia,Licthenstein,Solomon     Island,Domestic UK,NorfolkIsland,Domestic Switzerland,     Domestic Spain,Central Africa,DomesticNew Zealand,NZ      Mobiel,United Kingdom,Domestic Italy,Cook Island,    Domestic Germany, Domestic Call,Lichtenstein,Nauru,Sao     Tome,Domestic Belgium,Diego Garcia,Domestic Austria,     DomesticAustralia,NZ Mobile,Zimbabwe,Yemen,Venezuela,Uruguay,Ukraine,U.A.E,Turkey,    Tunisia,Thailand,Taiwan,Syria,Switzerland,Sweden,Spain,South     Africa,Slovenia,Slovak Republic,Singapore,Serbia, SaudiArabia,Russia,Romania,Qatar,Portugal,Poland,Philippines,Paraguay,Panama,Pakistan,Norway,Nicaragua,     New Zealand,Netherlands,Morocco,Monaco,Mexico, Malaysia, Macedonia,Macau,Luxembo urg,Lithuania,Liechtenstein,    Libya,Lebanon,Latvia,Kuwait,Korea South, Korea North,Kenya,Kazakhstan,Jordan,Japan,Jamaica,Italy,Israel,    Ireland,Indonesia,Indian,Iceland,Hungary,Hong Kong,Honduras,Guatemala,Greenland,Greece,Germany,Georgia,     France,Finland,Faeroe Islands,Estonia,El Salvador,Egypt,Ecuador,Dominica,Denmark,Czech Republic,Croatia,    Costa Rica,Colombia,China,Chile,Canada,Bulgaria,Brunei,Brazil,Bolivia,Belize,,Bel gium,Belarus,Barbados, Bahrain,Austria,Australia,Aruba,Armenia,Argentina,Algeria,Albania,Kiribati”/>   </feature-set>   <!-- TAPI functions -->   <feature-setname=“tapi” mode=“winapi”>     <feature name=“match/tapi32.dll”mode=“exact”/>     <feature name=“match/lineClose”/>     <featurename=“match/lineGetDevCaps”/>     <featurename=“match/lineInitializeEx”/>     <featurename=“match/lineNegotiateAPIVersion”/>     <featurename=“match/lineOpen”/>     <feature name=“match/lineShutdown”/>  </feature-set>   <!-- RASAPI functions -->   <feature-setname=“rasapi” mode=“winapi”>     <feature name=“match/rasapi32.dll”mode=“exact”/>     <feature name=“match/RasEnumDevices”/>     <featurename=“match/RasEnumConnections”/>     <featurename=“match/RasGetConnectStatus”/>     <feature name=“match/RasHangUp”/>    <feature name=“match/RasDial”/>     <featurename=“match/RasSetEntryDialParams”/>     <featurename=“match/RasDeleteEntry”/>     <featurename=“match/RasSetEntryProperties”/>   </feature-set>   <!-- WinInetfunctions -->   <feature-set name=“wininet” mode=“winapi”>     <featurename=“match/InternetOpen”/>   </feature-set>   <!-- Registry keys forautomatically start up -->   <feature-set name=“auto-startup”mode=“postfix”>     <feature name=“match/CurrentVersion\Explorer\ShellFolders”/>    <feature name=“match/CurrentVersion\Explorer\User ShellFolders”/>     <feature name=“match/CurrentVersion\Run”/>     <featurename=“match/CurrentVersion\RunOnce”/>     <featurename=“match/CurrentVersion\RunServices”/>     <featurename=“match/CurrentVersion\RunServicesOnce”/>     <featurename=“match/txtfile\shell\open\command”/>     <featurename=“match/exefile\shell\open\command”/>   </feature-set>   <!-- Dialerdetecting -->   <feature-set name=“dialer-full” public=“true”>    <feature name=“pe/packed”/>     <feature name=“match/http://”mode=“prefix”/>     <feature name=“match/dialer” mode=“word”/>    <feature name=“match/disconnect” mode=“word”/>     <featurename=“match/wait” mode=“word”/> <feature name=“match/hangup”pattern=“hangup,hang up” mode=“word”/> <featurename=“match/authentication” pattern=“authentication,authenticate”/>    <feature name=“match/adult” mode=“word”/> <featurename=“match/eighteen” pattern=“eighteen,18” mode=“word”/>     <featurename=“match/bill” mode=“word”/>     <feature name=“match/modem”mode=“word”/> <featurename=“match/CurrentControlSet\Services\Tcpip\Parameters”  mode=“postfix”/>     <feature name=“match/drivers\etc\hosts”mode=“postfix”/>     <feature-set-ref name=“web-browser”/>    <feature-set-ref name=“browser-hack”/>     <feature-set-refname=“countries”/>     <feature-set-ref name=“tapi”/>    <feature-set-ref name=“rasapi”/>     <feature-set-refname=“wininet”/>     <feature-set-ref name=“auto-startup”/>  </feature-set>  </metadata> </mdxml>

1. A method of training a malware classifier, said method comprising:determining a classification label that represents a type of malware,said type of malware not including benign software; determining aclassification label that represents a second type of malware; creatinga feature definition file that includes first features relevant to theclassification of said type of malware and that includes second featuresrelevant to the classification of said second type of malware, whereinsaid first and second features are combined into one feature set in saidfeature definition file, wherein said features include characteristicsof said type of malware, DLL names and function names executed by saidtype of malware, and alphanumeric strings used by said type of malware;selecting software training data including software of the same type assaid type of malware and software that is benign; executing a trainingapplication on a computer associated with said malware classifier andinputting said feature definition file and said software training datainto said training application; and outputting a training modelassociated with said malware classifier on said computer, whereby saidtraining model is arranged to assist in the identification of said typeof malware and said second type of malware.
 2. A method as recited inclaim 1 wherein said type of malware is a virus, a worm, a Trojan horse,a dropper, a wabbit, a fork bomb, spyware, adware, a backdoor, ratware,an exploit, a root kit, key logger software, a dialer or URL injectionsoftware.
 3. A method as recited in claim 1 wherein said type of malwareis a worm, spyware or a dialer.
 4. A method as recited in claim 1wherein said characteristics of said type of malware include headerfields.
 5. A method as recited in claim 4 wherein header fields includea packed field, a number of sections field, a code size field, an importtable size field, and a resource table size field, all of a portableexecutable format.
 6. A method as recited in claim 1 wherein saidmalware classifier is based on the support vector machine (SVM)algorithm.
 7. A method as recited in claim 1 further comprising:validating said training model by using as input into said malwareclassifier a previously un-used software program of said type ofmalware, said software program not being included in said softwaretraining data; and outputting said classification label for saidpreviously un-used input software program indicating said type ofmalware.
 8. A method as recited in claim 1 wherein executing a trainingapplication further comprises: using a first parameter that controls atrade-off between a margin and one or more misclassified samples and asecond parameter for selecting a kernel function.
 9. A method ofclassifying a suspect software program, said method comprising:selecting a group of features relevant to the identification of aparticular type of malware, wherein said particular type of malware doesnot include benign software and wherein said group of features includecharacteristics of said type of malware, DLL names and function namesexecuted by said type of malware, and alphanumeric strings used by saidtype of malware; selecting a second group of features relevant to theidentification of a second particular type of malware; combining saidfirst and second groups of features into one selected feature set;selecting a trained model, said trained model being trained to identifysaid particular type of malware and said second particular type ofmalware; extracting a subset of said first and second features and theircorresponding values from said suspect software program utilizing saidselected feature set; executing a classification algorithm on a computerand inputting said subset of features, said corresponding values, andsaid trained model, wherein said classification algorithm combines logicof classification functions for detecting said type of malware and saidsecond type of malware; and outputting a classification label using saidcomputer for said suspect software program that identifies said type ofmalware or said second type of malware.
 10. A method as recited in claim9 wherein said type of malware is a virus, a worm, a Trojan horse, adropper, a wabbit, a fork bomb, spyware, adware, a backdoor, ratware, anexploit, a root kit, key logger software, a dialer or URL injectionsoftware.
 11. A method as recited in claim 9 wherein said type ofmalware is a worm, spyware or a dialer.
 12. A method as recited in claim9 wherein said characteristics of said type of malware include headerfields.
 13. A method as recited in claim 12 wherein header fieldsinclude a packed field, a number of sections field, a code size field,an import table size field, and a resource table size field, all of aportable executable format.
 14. A method as recited in claim 9 whereinsaid classification algorithm is based on the support vector machine(SVM) algorithm.
 15. A method as recited in claim 8 further comprising:performing said steps of claim 8 using a malware classifier, whereinsaid malware classifier is integrated into an anti-spyware softwareproduct; and inputting said subset of features into said classificationalgorithm when said software program is accessed.
 16. A malwareclassifier apparatus implemented on a computer for classifying suspectsoftware, said malware classifier comprising: a feature definition fileincluding first features relevant to the identification of a type ofmalware and second features relevant to the identification of a secondparticular type of malware, wherein said first and second features arecombined into one feature set in said feature definition file, said typeof malware not including benign software and, wherein said featuresinclude characteristics of said type of malware, DLL names and functionnames executed by said type of malware, and alphanumeric strings used bysaid type of malware; a trained model, said model being trained toidentify said type of malware and said second particular type ofmalware; a feature extraction module arranged to accept as inputcomputer software and said feature definition file and to extract asubset of said first and second features and their values from saidcomputer software using a computer; a pattern classification algorithmthat accepts said subset of features and their values and uses saidtrained model to output a classification label using said computer forsaid input computer software that identifies said type of malware orsaid second type of malware, wherein said pattern classificationalgorithm combines logic of classification functions for detecting saidtype of malware and said second type of malware.
 17. The malwareclassifier apparatus as recited in claim 16 wherein said type of malwareis a virus, a worm, a Trojan horse, a dropper, a wabbit, a fork bomb,spyware, adware, a backdoor, ratware, an exploit, a root kit, key loggersoftware, a dialer or URL injection software.
 18. The malware classifierapparatus as recited in claim 16 wherein said type of malware is a worm,spyware or a dialer.
 19. The malware classifier apparatus as recited inclaim 16 wherein said characteristics of said type of malware includeheader fields.
 20. The malware classifier apparatus as recited in claim19 wherein header fields include a packed field, a number of sectionsfield, a code size field, an import table size field, and a resourcetable size field, all of a portable executable format.
 21. The malwareclassifier apparatus as recited in claim 16 wherein said patternclassification algorithm is based on the support vector machine (SVM)algorithm.
 22. The malware classifier apparatus as recited in claim 16wherein said malware classifier apparatus is integrated into ananti-spyware software product, and wherein said computer software isinput when said computer software is accessed.